Lecture 3
This lecture taught by of Prof. Cathy Yi-Hsuan Chen focuses on web language including HTML, XML, JSON, and the task of parsing, and RSS news feeds
Specifically, the code can be found in the Github
Here is a video from Coursera
Outlines
Web language
- HTML (Hypertext Markup Language): the standard markup language for documents designed to be displayed in a web browser. HTML was designed to display data - with focus on how data looks
- XML (eXtensible Markup Language): XML was designed to carry data - with focus on what data is, and XML tags are not predefined like HTML tags are
- JSON (JavaScript Object Notation): JSON is a syntax for storing and exchanging data between a browser and a server
HTML
- Describes the structure of a Web page
- Consists of a series of elements represented by tags
- Elements tell the browser how to display the content
- HTML tags label pieces of content such as "heading", "paragraph", "table", and so on
- Browsers do not display the HTML tags, but use them to render the content of the page
<!DOCTYPE html> # declaration represents the document type
<html> # element is the root element of an HTML page
<head> # element contains meta information about the document
<title>Page Title</title> # specifies title for the document
</head>
<body> # contains the visible page content
<h1>My First Heading</h1> # defines a large heading
<p>My first paragraph.</p> # defines a paragraph
</body>
</html>
XML
- XML language has no predefined tags
- Tags are "invented" by the author of the XML document
- Author must define both the tags and the document structure
<note>
<date>2015-09-01</date>
<hour>08:30</hour>
<to>Tove</to>
<from>Jani</from>
<body>Don't forget me this weekend!</body>
</note>
<employees>
<employee>
<firstName>John</firstName> <lastName>Doe</lastName>
</employee>
<employee>
<firstName>Anna</firstName> <lastName>Smith</lastName>
</employee>
<employee>
<firstName>Peter</firstName> <lastName>Jones</lastName>
</employee>
</employees>
JSON
- JSON is a syntax for storing and exchanging data
- JSON is text, written with JavaScript object notation
- A lightweight data-interchange format, and it is "self-describing"
- We can also convert any JSON received from the server into JavaScript objects, and work with the data as JavaScript objects
JSON Syntax Rules
- Data is in key/value pairs
- Data is separated by commas
- Curly braces hold objects
- Square brackets hold arrays
{"employees":[
{ "firstName":"John", "lastName":"Doe" },
{ "firstName":"Anna", "lastName":"Smith" },
{ "firstName":"Peter", "lastName":"Jones" }
]}
Python Requests Module
Make a request to a web page, and print the response text
import requests
# Sends a GET request to the specified url
x = requests.get('https://w3schools.com/python/demopage.htm')
print(x.text)
Parsing
Parsing is the process of analyzing a string of symbols. The term parsing comes from Latin pars (orationis), meaning part (of speech). Within computational linguistics the term is used to refer to the formal analysis by a computer of a sentence or other string of words into its constituents, resulting in a parse tree showing their syntactic relation to each other, which may also contain semantic and other information.
Parser
- A parser is a software component that takes input data (frequently text) and builds a data structure – often some kind of parse tree, abstract syntax tree or other hierarchical structure, giving a structural representation of the input.
- In the case of data languages, a parser is often found as the file reading facility of a program, such as reading in HTML or XML text; these examples are markup languages.
Parsing XML
import requests
import xml.dom.minidom # module for XML parser
response = requests.get(
"https://www.treasury.gov/resource-center/data-chart-center/interest-rates/Datasets/daily_treas_bill_rates.xml")
content_1 = response.content
dataDOM_1 = xml.dom.minidom.parseString(content_1)
response = requests.get(
"https://news.google.com/news/rss/headlines/section/q/finance%20news/finance%20news?ned=us&hl=en")
content_2 = response.content
dataDOM_2= xml.dom.minidom.parseString(content_2)
Parsing JSON
import requests
import json
import pandas as pd
url = 'http://data.thecrix.de/data/crix.json'
r = requests.get(url)
content = r.content
# json.loads : parse a JSON string
js_content = json.loads(content)
for item in js_content:
print(item)
# make a data frame
data_raw = pd.DataFrame(js_content)
data_raw.set_index(keys='date', inplace=True)
# make a time-series plot
data_raw.plot()
RSS news feed
RSS (originally RDF Site Summary) is a web feed which allows users and applications to access updates to websites in a standardized, computer-readable format. These feeds can, for example, allow a user to keep track of many different websites in a single news aggregator. The news aggregator will automatically check the RSS feed for new content, allowing the list to be automatically passed from website to website or from website to user.
note: web feed (or news feed) is a data format used for providing users with frequently updated content.
Financial times news feed
Please visit Financial times RSS feed, and click business education RSS feed
import feedparser # parser for parsing RSS feed
# retrieve RSS feedback
content = feedparser.parse("https://www.ft.com/?edition=international&format=rss")
# list all titles
print("\nTitles-------------------------\n")
for index, item in enumerate(content.entries):
print("{0}.{1}".format(index, item["title"]))
# list all description
print("\r\nDescriptions-------------------\r\n")
for index, item in enumerate(content.entries):
print("{0}.{1}\n".format(index, item["description"]))
Wall Street Journal news feed
please visit Wall street journal RSS feed
Choose the news category of interest, for instance U.S. Business
mport feedparser # parser for parsing RSS feed
# retrieve RSS feedback for US. business news
content = feedparser.parse("https://feeds.a.dj.com/rss/WSJcomUSBusiness.xml")
# list all titles
dfData_title = pd.DataFrame(columns=['title'])
for index, item in enumerate(content.entries):
dfData_title = dfData_title.append({'title': item["title"]}, ignore_index=True)
print("{0}.{1}".format(index, item["title"]))
# list all description
print("\r\nDescriptions-------------------\r\n")
dfData_des = pd.DataFrame(columns=['description']) # create a dataframe
for index, item in enumerate(content.entries):
dfData_des = dfData_des.append({'description': item["description"]}, ignore_index=True)
print("{0}.{1}\n".format(index, item["description"]))
Coursework
Please search for other news feed, and try to implement parsing news feed
You can consider BBC news feed, news categories in Wall street journal RSS feed, or others in the worldwide.